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Abstract-Knowing the protein 
structure helps us to investigate 
diseases in human beings related to 
abnormal or impaired folded 
proteins. This research provides a 
solution for how to identify the 
misbalance of homotypic and 


heterotypic contacts on the 
sequential stage. There are two 
methods of protein structure 


prediction, template based and Ab- 
initio models. Template based model 
matches the given sequence with the 
original sequence. Whereas, Ab- 
initio calculates the weight of the 
given sequence and identifies 
whether it is balanced or not. If the 
sequence is not in balance, it can be 
labeled as on the initial stage by 
calculating its weight. In this 
research, future directions to 
researchers are provided as how to 
achieve maximum accuracy in 
protein structure prediction. 


Index Terms-Ab-initio modeling, 
Alphafold2, hetrotypic, homoptypic, 


limitations, misbalance, 
structure prediction 


protein 


L. Introduction 


Knowledge and prediction of 
protein structures help us to 
understand their working, that is, 
how these chemicals help human 
beings in their daily life. The word 
‘protein’ comes from the Greek 
word ‘proteios’ which means 
‘highest importance’. Individual 
proteins are categorized based on 
their functions which describe the 
tasks they do. Protein structure 
recognition is used by the immune 
system, which is in charge of our 
body's defense. Knowing the protein 
structure helps us to know diseases 
in human beings related to abnormal 
or impaired, folded proteins. 
Proteins can be categorized on the 
basis of functions they perform. For 
example, structural proteins help to 
determine cell shape and integrity. 
These proteins also play a vital role 
in the mitosis and meiosis of the 
cell’s reproduction and also in the 
immune system of our body (by 


* Corresponding author: Noman.Khalid.1122.mnk@gmail.com 


School of System and Technology 


UMT— 19 


Volume 2 Issue 1, Spring 2022 


Protein Structure Prediction 


structural recognition of 
immunoglobulins). Thus, knowing 
and modifying these protein 
structures can revolutionize the 
medical field. 


Linus Paul [1] first predicted the 
spiral structure of proteins in 1936. 
Afterwards, with the help of 
technological advancements in 
biology, scientists discovered 4 
different levels of protein structure 
which are primary, secondary, 
tertiary, and quaternary levels. 
Primary structure comprises the 
sequence of amino acids in its 
polypeptide chain. Secondary 
structure constitutes polypeptide’s 
backbone which is the main chain is 
its local spatial arrangement. 
Tertiary structure forms the three- 
dimensional structure of the entire 
chain of polypeptides. Lastly, 
quaternary structure comprises the 
three-dimensional arrangement of 
subnet in multi-subnet protein. 


There are two methods used to 
determine protein structure 
including X-Ray Crystallography 
and Proton Nuclear Magnetic 
Resonance (PNMR). These 
methods help us to visualize the 
different layers of protein structure. 
However, the main problem is the 
cost of determining the protein 
structure which remains very high. 
According to the X-Ray 
Crystallography Facility (XRCF), 


the average cost of these tests is 
around 450$ [2] per sample. The 
Ab-initio [3] model was developed 
to predict the secondary structure of 
protein. Two models were 
developed, namely PaleAle 4.0 with 
the accuracy of 80.0% and Porter 
4.0 with the accuracy of 82.2%. 
Bidirectional Recurrent Neural 
Network (BRNN) [4] was used to 
predict the secondary structure. 
DeepCNF [5] was developed based 
on machine learning to predict the 
protein structure with the accuracy 
of 82.3%. DeepCNF is also used to 
predict the IDRs (intrinsically 
disordered regions) of proteins. 
However, with the training of AUC 
[6], the model achieved the 
accuracy of 84.5%. Spider 3 [7] was 
developed to predict the secondary 
structure of proteins by using Long 
Short Term Memory (LSTM)[8] 
and Bidirectional Recurrent Neural 
Network (BRNNs) [9] It achieved 
the accuracy of 83.9%. MUFOLD- 
SS [10] was developed to predict the 
secondary structure of proteins. It 
achieved the accuracy of 88.20% in 
easy cases and 83.37% in hard 
cases. Easy cases are those in which 
the hit value or e-value is <=0.5, 
while hard cases are those where hit 
value or e-value is >0.5. Ab-initio 
[11] model was developed with the 
updated version of Porter 4.0 model. 
Porter 5.0 achieved the accuracy of 
84.19% in protein structure 
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prediction. SPOT-1D [12] predicted 
the protein structure with the 
accuracy of 86.18%. SPOT-1D uses 
Deep Neural Network (DNN) 
architecture based on recurrent and 
convolutional methods. NetSurfP- 
2.0 [13] was developed to predict 
the secondary structure of proteins 
from their primary sequence. All 
these models were developed and 
used to predict the one-dimensional 
secondary structure of proteins. The 
accuracy of these models is based on 
three (3) class labels in the current 
study. The tertiary and quaternary 
3D structures were found to be 
problematic. The challenge is how 
to visualize the tertiary structure and 
then combine all the visualized 
forms to create the quaternary 
structure. Several methods have 
been developed to predict the 
structure of proteins. DeepMind 
[14] have developed a model to 
predict protein structure known as 
AlphaFold 1 [15] with CASP 13 
[16]. AlphaFold 1 uses concurrent 
neural network architecture to 
predict protein structure, while 
AlphaFold 2 [17] with CASP 14 
[18] uses the transformer. 
Transformer adopts the self- 
attention mechanism which takes 
sequential input data. Still, its 
prediction is very low in case of 
homotypic and heterotypic contacts. 
This research provides a solution for 
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predicting misbalanced homotypic 
and heterotypic contacts. 


II. Related Work 


Many models have been 
developed to predict the protein 
structure covered in these reviews 
[19] [20] [21] [22] [23]. Despite 
applying neural network 
architecture for prediction [19] [22] 
[23], the improved structure 
prediction of protein [15] [24] [25] 
[26]. These approaches follow the 
improvement of computer vision 
systems [27]. They attempt to fold 
the tertiary structure of proteins to 
make the quaternary structure [28] 
[29] [30], which ultimately creates 
the 3D structure of proteins. Few 
models have been developed to 
predict the protein structure, directly 
[31] [32] [33] [34]. However, these 
approaches fail to match the 
previous structure prediction 
pipelines [35]. Still, the success of 
transformers, which are self- 
attention based model for language 
processing [36] and more recently, 
of computer vision based models 
[37] [38], has diverted the attention 
of researchers to adopt the self- 
attention based approaches [39] [40] 
[41]. 


IH. Research Methodology 


The goal of this research is to 
providea review of existing 
problems and their solutions for 
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protein structure 
prediction. We followed the 
methodology of a survey that was 
designed by various researchers. 
The research objectives of this 
paper are as follows: 


1. Workings of AlphaFold2 

2. Limitations of AlphaFold2 

3. Solutions for a small 
number of homotypic and a 
large number of 
heterotypic contacts 

4. Feature extraction of 
protein 


A. CASP (Critical Assessment of 
Protein Structure Prediction) 


CASP is a community 
experiment conducted every two 
years since 1993 on a large scale. 
Experimentally determined 
information is passed on to the 
predicator of protein structure. 
When predictions are made, neither 
the predictor nor the organizer and 
accessor know about them. These 
predictions are then solved by X- 
Ray Crystallography and PNMR. 
Afterwards, these entries are kept in 
hold by PDB (Protein Data Bank). 


IV. Results 
A. How Alpha Fold 2 Works 


AlphaFold2 [17] achieved the 
median score of 92.4 GDT (Global 
Distance Test). It indicates that even 
with the hardest protein targets, it 


can predict protein structure 
comparable to the width of an atom. 
The model was trained on CASP 14 
with 170000 known protein 
structures. Although, for a problem 
like protein structure prediction, this 
is a very small number. They have 
taken a much larger dataset from 
unknown structures of protein 
sequences. They have learned to 
extract information from unlabeled 
data, for example, unsupervised 
learning which enables a lot of Al 
breakthroughs. GPT 3 [54] 
(Generative Pre-trained 
Transformer) was trained on a huge 
amount of data collected from the 
web. Then, it was given a slice of 
sentence and it had to predict which 
words were likely to come in the 
next sentence. In another example, a 
slice of an image was given to the 
model and the model was asked to 
predict the remaining part of the 
image. 


B. Limitations of AlphaFold 2 

AlphaFold2 [17] uses 
transformer, a deep learning model 
based on self-attention mechanism. 
However, this model slows down 
when the sequence size of protein is 
increased [55]. Another limitation 
highlighted by the AlphaFold2 [17] 
team is that it’s prediction is much 
weaker for those proteins who have 
a small number of homotypic 
contacts. 
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Table I 
Verion Pinan Year of No of No. of ie 
Publishing Entries Sequences 
CASP1 1994 229 1 [42] 
CASP2 1996 212 2 [43] 
CASP3 1998 235 2 [44] 
CASP4 2000 203 1 [45] 
CASP5 2002 191 3 [46] 
CASP6 2004 217 2 [47] 
CASP7 2006 217 1 [48] 
CASP8 2008 246 1 [49] 
CASP9 2010 229 3 [50] 
CASP10 2012 221 3 [51] 
CASPII 2014 117 1 [52] 
CASP12 2016 140 3 [53] 
CASP13 2018 150 1 [16] 
CASP14 2020 190 2 [18] 
School of System and Technology & UMT— 23 
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Multiple experiments were 
conducted for proteins with a large 
number of heterotypic contacts [56] 
on a recent PDB dataset [57]. 
Homotypic contacts are defined by 
the attachment of one cell to another 
cell and these cells have to be 
identical. Whereas, in heterotypic 
contact protein’s physical 
interaction have different primary 
structure. In protein 3D structure 
prediction, there is a sequence. 
Firstly, primary structure is 
predicted based on the numbers of 
amino acids in the polypeptide 
chain. Then, in the secondary 
structure, sequence from the 
primary structure is classified into 
different parts, namely Alpha 
Hilux, Beta strand, and random coil. 
Afterwards, in the tertiary structure, 
Alpha Hilux, Beta strand, and 
random coil are visualized 
separately. Finally, in the 
quaternary structure, all of these 
visualized forms are folded together 
to create a 3D structure of the 
protein. 


C. How to Calculate Homotypic 
and Heterotypic Contacts 


In advanced metastasis tumors, 
due to the lack of tumor suppressor 
genes different cells types 
(heterotypic) grow in between 
normal cells types, for example, if 
normal alignment consists of 
epithelial cells in epithelial cell 
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types, then, in advance metastasis 
tumors there will be some other 
type of cells in epithelial cells, such 
as connective tissue cells [58]. 


In a template-based model, for 
example, we have a protein whose 
original sequence is: 

Gly-Ala-Pro-Leu-Val-Met-Val- 
Pro-Ala-Cys-Gly-Ala-Pro-Leu- 
Val-Met-Val-Pro-Ala-Cys-Gly- 
Ala-Pro-Leu-Val-Met-Val-Pro- 
Ala-Cys-Gly-Ala-Pro-Leu-Val- 
Met-Val-Pro-Ala-Cys-Gly-Ala- 
Pro-Leu-Val-Met-Val-Pro-Ala-Cys 

The sequence obtained from the 
user is: 

Gly-Trp-Pro-Leu-Val-Met-Val- 
Pro-Ala-Cys-Gly-Ala-Pro-Leu- 
Val-Met-Val-Pro-Ala-Cys-Gly- 
Ala-Pro-Leu-Val-Met-Val-Pro- 
Ala-Cys-Gly-Ala-Pro-Leu-Val- 
Met-Val-Pro-Ala-Cys-Gly-Ala- 
Pro-Leu-Val-Met-Val-Pro-Ala-Cys 

So, there is the original 
sequence and also the original 
weight of this sequence. The user 
sequence is matched with the 
original sequence. After matching 
the sequence with the original 
sequence, it was found that alanine 
is replaced with tryptophan in the 
user sequence. Tryptophan is the 
heaviest amino acid among all the 
essential amino acids. So, it 
automatically changes the weight of 
the user sequence, thus making the 
misbalance of homotypic and 
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heterotypic contacts in the user 
sequence easily identifiable. This 
problem can be solved by using 
machine learning algorithms. Ten 
machine learning algorithms were 
developed and implemented [59], 
each with K-Fold cross-validation 
testing. A feature extraction method 
for protein [60] was created on a 
live server. This app requires 
protein sequence in FASTA format 
and it automatically creates a csv 
file to be used in machine learning 
algorithms. The problem is that 
creating a dataset which has all the 
original sequences of protein is not 
possible. In case of human proteins 
all the genes have been identified. 
So, this approach can resolve the 
problem. 

The second method is the Ab- 
initio model which is the 
computational matrix of quantum 
chemistry. In this model, the weight 
of protein sequence is calculated to 
identify ifit is in balance or not. 

This fact was revealed when the 
user sequence was matched with the 
protein’s real sequence [61]. 
Weight calculation showed that the 
user sequence has a higher weight 
because it has tryptophan in the 
amino acid chain. While, the real 
sequence has alanine at the position 
of the R. Since tryptophan is the 
heaviest among all, thus the result 
can be assessed by matching it with 
the real sequences. After 
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calculations, it was found that the 
real weight of the original amino 
acid is lower than the user’s amino 
acid sequence. It is due to the fact 
that the weight of alanine is lower 
than tryptophan, which is present in 
the user’s amino acid sequence. So, 
the weight of the amino acid chains 
can be calculated by matching them 
with their original sequences and 


the errors can be shown 
numerically. 
D. Feature Extraction for 
Protien 


For the extraction of protein 
features, the first step is to compute 
the matrix formation of the input 
protein query. The protein sequence 


length is used to build the 
following: 

1.Position Relative Incidence 

Matrix (PRIM) 

2. Reverse Position Relative 

Incidence Matrix (RPRIM) 

3. Accumulative Absolute 
Position Incidence Vector 
(AAPIV) 

4. Reverse Accumulative 
Absolute Position Incidence 
Vector (RAAPIV) 


5. Frequency Vector (FV) 
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Live server is created for the 
feature extraction of proteins [60] 
based on Chou’s 5-step rule [62]. 
The server accepts only FASTA 
format. 


V.Discussion and Future 
Research Work 


The current paper briefly discussed 
protein structure prediction keeping 
in view the previous research 
conducted in this field. The primary 
sequence was predicted based on 
the numbers of amino acids in the 
polypeptide chain. Secondary 
structure prediction included the 
identification of Alpha Helix, Beta 
strand, and random coil. Tertiary 
structure included the visualization 
of these classes and in the 
quaternary structure, these 
visualized forms were merged 
together to create the final 3D 
structure of the target protein. 
AlphaFold2 with CASP 14 
achieved the maximum frequency 
of 92 GDT. Although, it was also 
found that there are limitations to 
AlphaFold2 algorithms. If there are 
a small number of homotypic and a 
large number of heterotypic 
contacts, their prediction GDT is 
very low. This issue can be resolved 
by homology based modeling and 
Ab-initio modeling. In homology 
based modeling, a protein feature 
extraction technique is developed 
on a live server to test and create 10 


machine learning algorithms via K- 
Fold cross-validation testing. 
However, it was found that the best 
method to predict the protein 
structure is Ab-initio. A solution 
with one real-time example was 
proposed regarding how to 
calculate the weight of protein 
sequences and identify those not in 
balance. On the basis of this 
technique, protein folding and 
repairing technique can also be 
applied to achieve the maximum 
GDT in protein structure prediction. 
Since Ab-initio is based on 
calculating rather than predicting 
structure (based on CASP), it is the 
best method to predict the protein 
structure. 
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